Classification - Instacart Product Reorder Prediction
Predicting whether a customer will reorder products in their next order using the
Instacart Market Basket Analysis dataset.
Dataset Source:
Kaggle Instacart Market Basket Analysis
Problem Type: Classification Target Variable: Binary prediction of whether a
product will be reordered (1) or not (0) Use Case: E-commerce recommendation
systems, inventory management, personalized marketing campaigns
Package Imports
Xplainable Cloud Setup
1!pip install xplainable
2!pip install xplainable-client
1import pandas as pd
2import xplainable as xp
3from xplainable.core.models import XClassifier
4from xplainable.core.optimisation.bayesian import XParamOptimiser
5from sklearn.model_selection import train_test_split
6import requests
7import json
8
9
10import numpy as np
11import matplotlib.pyplot as plt
12import seaborn as sns
13import warnings
14import gc
15
16import xplainable_client
1
2client = xplainable_client.Client(
3 api_key="",
4)
Data Loading and Exploration
Load the Instacart Market Basket Analysis datasets.
Note: Download the datasets from
Kaggle and
extract the CSV files, or use the direct download links below.
Dataset Download Instructions
Option 1: Direct Download (Recommended for beginners)
- Visit the
Kaggle Instacart Market Basket Analysis competition
- Create a free Kaggle account if you don't have one
- Download the dataset ZIP file (~200MB)
- Extract all CSV files to your working directory
Option 2: Kaggle API (Recommended for experienced users)
pip install kaggle
kaggle competitions download -c instacart-market-basket-analysis
unzip instacart-market-basket-analysis.zip
Note: The dataset contains 6 CSV files with over 3 million orders and 32 million
order products.
1
2try:
3 orders = pd.read_csv('orders.csv')
4 order_products_train = pd.read_csv('order_products__train.csv')
5 order_products_prior = pd.read_csv('order_products__prior.csv')
6 products = pd.read_csv('products.csv')
7 aisles = pd.read_csv('aisles.csv')
8 departments = pd.read_csv('departments.csv')
9
10
11 print("Datasets loaded successfully!")
12 print(f"Orders: {orders.shape}")
13 print(f"Order Products (Train): {order_products_train.shape}")
14 print(f"Order Products (Prior): {order_products_prior.shape}")
15 print(f"Products: {products.shape}")
16 print(f"Aisles: {aisles.shape}")
17 print(f"Departments: {departments.shape}")
18
19except FileNotFoundError as e:
20 print(f"Error loading files: {e}")
21 print("Please ensure you have downloaded and extracted the Instacart dataset files.")
22 print("Files should be in the same directory as this notebook.")
Inspecting orders dataset
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order |
|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | nan |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15 |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21 |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29 |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28 |
Out: <class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 order_id int64
1 user_id int64
2 eval_set object
3 order_number int64
4 order_dow int64
5 order_hour_of_day int64
6 days_since_prior_order float64
dtypes: float64(1), int64(5), object(1)
memory usage: 182.7+ MB
Out: order_id 0
user_id 0
eval_set 0
order_number 0
order_dow 0
order_hour_of_day 0
days_since_prior_order 206209
dtype: int64
We have observed that there are 206209 missing values in days_since_prior_order column.
Inspecting order_products_train dataset
1
2order_products_train.head()
| order_id | product_id | add_to_cart_order | reordered |
|---|
| 0 | 1 | 49302 | 1 | 1 |
| 1 | 1 | 11109 | 2 | 1 |
| 2 | 1 | 10246 | 3 | 0 |
| 3 | 1 | 49683 | 4 | 0 |
| 4 | 1 | 43633 | 5 | 1 |
1
2order_products_train.shape
1
2order_products_train.isnull().sum()
Out: order_id 0
product_id 0
add_to_cart_order 0
reordered 0
dtype: int64
Inspecting order_products_prior dataset
1
2order_products_prior.head()
| order_id | product_id | add_to_cart_order | reordered |
|---|
| 0 | 2 | 33120 | 1 | 1 |
| 1 | 2 | 28985 | 2 | 1 |
| 2 | 2 | 9327 | 3 | 0 |
| 3 | 2 | 45918 | 4 | 1 |
| 4 | 2 | 30035 | 5 | 0 |
1
2order_products_prior.shape
1
2order_products_prior.isnull().sum()
Out: order_id 0
product_id 0
add_to_cart_order 0
reordered 0
dtype: int64
Inspect products dataset
| product_id | product_name | aisle_id | department_id |
|---|
| 0 | 1 | Chocolate Sandwich Cookies | 61 | 19 |
| 1 | 2 | All-Seasons Salt | 104 | 13 |
| 2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
| 3 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 |
| 4 | 5 | Green Chile Anytime Sauce | 5 | 13 |
1
2products.isnull().sum()
Out: product_id 0
product_name 0
aisle_id 0
department_id 0
dtype: int64
Inspecting aisles dataset
| aisle_id | aisle |
|---|
| 0 | 1 | prepared soups salads |
| 1 | 2 | specialty cheeses |
| 2 | 3 | energy granola bars |
| 3 | 4 | instant foods |
| 4 | 5 | marinades meat preparation |
Out: aisle_id 0
aisle 0
dtype: int64
Inspecting departments dataset
| department_id | department |
|---|
| 0 | 1 | frozen |
| 1 | 2 | other |
| 2 | 3 | bakery |
| 3 | 4 | produce |
| 4 | 5 | alcohol |
1
2departments.isnull().sum()
Out: department_id 0
department 0
dtype: int64
Exploratory Data Analysis (EDA)
1
2plt.figure(figsize=(6,4))
3sns.countplot(x="order_dow", data=orders, color=color[0])
4plt.ylabel('Count', fontsize=12)
5plt.xlabel('Day of week', fontsize=12)
6plt.xticks(rotation='vertical')
7plt.title("Orders by week day", fontsize=15)
8plt.show()

The number of orders on weekends is more compared to weekdays as people stay at home and
might have wanted to enjoy the foods.
1
2plt.figure(figsize=(6,4))
3sns.countplot(x="order_hour_of_day", data=orders, color=color[0])
4plt.ylabel('Count', fontsize=12)
5plt.xlabel('Hour of day', fontsize=12)
6plt.xticks(rotation='vertical')
7plt.title("Orders by Hour of day", fontsize=15)
8plt.show()

Peak hours wheremaximum orders are done is between 9 AM- 5PM. Less orders are placed
before 7AM and after 11 PM.
1
2plt.figure(figsize=(10,6))
3sns.countplot(orders['days_since_prior_order'])
4plt.xticks(rotation=90)
5plt.show()
Maximum number of users order again after 1 month. People also order after a week and
this forms the second largest order habit.
1
2products_details = pd.merge(left=products,right=departments,how="left")
3products_details = pd.merge(left=products_details,right=aisles,how="left")
4products_details.head()
| product_id | product_name | aisle_id | department_id | department | aisle |
|---|
| 0 | 1 | Chocolate Sandwich Cookies | 61 | 19 | snacks | cookies cakes |
| 1 | 2 | All-Seasons Salt | 104 | 13 | pantry | spices seasonings |
| 2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 | beverages | tea |
| 3 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 | frozen | frozen meals |
| 4 | 5 | Green Chile Anytime Sauce | 5 | 13 | pantry | marinades meat preparation |
1
2plt.figure(figsize=(10,6))
3g=sns.countplot(x="department",data=products_details)
4g.set_xticklabels(g.get_xticklabels(), rotation=40, ha="right")
5plt.show()

Personal care is the most abundant type of department available followed by snacks.
1
2plt.figure(figsize=(10,6))
3top10_aisle=products_details["aisle"].value_counts()[:10].plot(kind="bar",title='Aisles')

missing is the aisle with most products available.
1
2order_products_name_train = pd.merge(left=order_products_train,right=products.loc[:,["product_id","product_name"]],on="product_id",how="left")
1
2common_Products=order_products_name_train[order_products_name_train.reordered == 1]["product_name"].value_counts().to_frame().reset_index()
3plt.figure(figsize=(12,7))
4plt.xticks(rotation=90)
5sns.barplot(x="product_name", y="index", data=common_Products.head(10))
6plt.ylabel('product_name', fontsize=12)
7plt.xlabel('count', fontsize=12)
8plt.show()

Banana is the most common type of product bought by people followed by Bag of organic
banana.
1
2order_products_name_train = pd.merge(left=order_products_name_train,right=products_details.loc[:,["product_id","aisle","department"]],on="product_id",how="left")
1
2common_aisle=order_products_name_train["aisle"].value_counts().to_frame().reset_index()
3plt.figure(figsize=(12,7))
4plt.xticks(rotation=90)
5sns.barplot(x="aisle", y="index", data=common_aisle.head(10),palette="Blues_d")
6plt.ylabel('aisle', fontsize=12)
7plt.xlabel('count', fontsize=12)
8plt.show()

Fresh vegetable aisle has the highest number of sales followed by fresh_fruits.
1
2common_aisle=order_products_name_train["department"].value_counts().to_frame().reset_index()
3plt.figure(figsize=(12,7))
4plt.xticks(rotation=90)
5sns.barplot(x="department", y="index", data=common_aisle,palette="Blues_d")
6plt.ylabel('department', fontsize=12)
7plt.xlabel('count', fontsize=12)
8plt.show()
9

produce and dairy eggs are the top 2 departments with the highest number of sales.
1
2train_data_reordered = order_products_train.groupby(["order_id","reordered"])["product_id"].apply(list).reset_index()
3train_data_reordered = train_data_reordered[train_data_reordered.reordered == 1].drop(columns=["reordered"]).reset_index(drop=True)
4train_data_reordered.head()
| order_id | product_id |
|---|
| 0 | 1 | [49302, 11109, 43633, 22035] |
| 1 | 36 | [19660, 43086, 46620, 34497, 48679, 46979] |
| 2 | 38 | [21616] |
| 3 | 96 | [20574, 40706, 27966, 24489, 39275] |
| 4 | 98 | [8859, 19731, 43654, 13176, 4357, 37664, 34065... |
1. Data Preprocessing
Feature Engineering and Data Preparation
Create features for user behavior, product characteristics, and user-product
interactions.
1
2del products_details
3del order_products_name_train
4del common_Products
5del common_aisle
6del train_data_reordered
7gc.collect()
1
2orders = orders.loc[orders.user_id.isin(orders.user_id.drop_duplicates().sample(frac=0.15, random_state=101))]
1
2aisles['aisle'] = aisles['aisle'].astype('category')
3departments['department'] = departments['department'].astype('category')
4orders['eval_set'] = orders['eval_set'].astype('category')
5products['product_name'] = products['product_name'].astype('category')
1
2prior_orders = pd.merge(orders, order_products_prior, on='order_id', how='inner')
3prior_orders.head()
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | product_id | add_to_cart_order | reordered |
|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 196 | 1 | 0 |
| 1 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 14084 | 2 | 0 |
| 2 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 12427 | 3 | 0 |
| 3 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 26088 | 4 | 0 |
| 4 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 26405 | 5 | 0 |
Create Features using user_id
1
2users = prior_orders.groupby(by='user_id')['order_number'].aggregate('max').to_frame('num_of_orders_for_each_user').reset_index()
3users.head()
| user_id | num_of_orders_for_each_user |
|---|
| 0 | 1 | 10 |
| 1 | 2 | 14 |
| 2 | 3 | 12 |
| 3 | 4 | 5 |
| 4 | 5 | 4 |
1
2
3
4toal_product_per_order = prior_orders.groupby(by=['user_id', 'order_id'])['product_id'].aggregate('count').to_frame('total_products_per_order').reset_index()
5
6
7avg_number_of_products_per_order = toal_product_per_order.groupby(by=['user_id'])['total_products_per_order'].mean().to_frame('avg_no_prd_per_order').reset_index()
8
9
10del [toal_product_per_order]
11gc.collect()
12
13
14avg_number_of_products_per_order.head()
| user_id | avg_no_prd_per_order |
|---|
| 0 | 1 | 5.9 |
| 1 | 2 | 13.9286 |
| 2 | 3 | 7.33333 |
| 3 | 4 | 3.6 |
| 4 | 5 | 9.25 |
1from scipy import stats
2import pandas as pd
3import numpy as np
4
5def calculate_mode(x):
6 if len(x) > 0:
7 mode_result = stats.mode(x)
8
9 if isinstance(mode_result.mode, np.ndarray) and mode_result.mode.size > 0:
10 return mode_result.mode[0]
11 else:
12 return mode_result.mode
13 else:
14 return pd.NA
15
16
17order_most_dow = prior_orders.groupby(by=['user_id'])['order_dow'].aggregate(calculate_mode).to_frame('dow_with_most_orders').reset_index()
18
19
20order_most_dow.head()
| user_id | dow_with_most_orders |
|---|
| 0 | 1 | 4 |
| 1 | 2 | 2 |
| 2 | 3 | 0 |
| 3 | 4 | 4 |
| 4 | 5 | 3 |
1def calculate_mode_hour(x):
2 if len(x) > 0:
3 mode_result = stats.mode(x)
4
5 if isinstance(mode_result.mode, np.ndarray) and mode_result.mode.size > 0:
6 return mode_result.mode[0]
7 else:
8 return mode_result.mode
9 else:
10 return pd.NA
11
12
13order_most_hod = prior_orders.groupby(by=['user_id'])['order_hour_of_day'].aggregate(calculate_mode_hour).to_frame('hod_with_most_orders').reset_index()
14
15
16order_most_hod.head()
17
| user_id | hod_with_most_orders |
|---|
| 0 | 1 | 7 |
| 1 | 2 | 9 |
| 2 | 3 | 16 |
| 3 | 4 | 15 |
| 4 | 5 | 18 |
1
2user_reorder_ratio = prior_orders.groupby(by='user_id')['reordered'].aggregate('mean').to_frame('reorder_ratio').reset_index()
3user_reorder_ratio['reorder_ratio'] = user_reorder_ratio['reorder_ratio'].astype(np.float16)
4user_reorder_ratio.head()
| user_id | reorder_ratio |
|---|
| 0 | 1 | 0.694824 |
| 1 | 2 | 0.476807 |
| 2 | 3 | 0.625 |
| 3 | 4 | 0.055542 |
| 4 | 5 | 0.378418 |
1
2users = users.merge(avg_number_of_products_per_order, on='user_id', how='left')
3users = users.merge(order_most_dow, on='user_id', how='left')
4users = users.merge(order_most_hod, on='user_id', how='left')
5users = users.merge(user_reorder_ratio, on='user_id', how='left')
6
7users.head()
| user_id | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio |
|---|
| 0 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 |
| 1 | 2 | 14 | 13.9286 | 13.9286 | 2 | 9 | 0.476807 |
| 2 | 3 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 |
| 3 | 4 | 5 | 3.6 | 3.6 | 4 | 15 | 0.055542 |
| 4 | 5 | 4 | 9.25 | 9.25 | 3 | 18 | 0.378418 |
1
2del [avg_number_of_products_per_order,order_most_dow,order_most_hod,user_reorder_ratio]
3gc.collect()
Create features using product_id.
1
2purchased_num_of_times = prior_orders.groupby(by='product_id')['order_id'].aggregate('count').to_frame('purchased_num_of_times').reset_index()
3purchased_num_of_times.head()
4
| product_id | purchased_num_of_times |
|---|
| 0 | 1 | 1852 |
| 1 | 2 | 90 |
| 2 | 3 | 277 |
| 3 | 4 | 329 |
| 4 | 5 | 15 |
1
2product_reorder_ratio = prior_orders.groupby(by='product_id')['reordered'].aggregate('mean').to_frame('product_reorder_ratio').reset_index()
3product_reorder_ratio.head()
| product_id | product_reorder_ratio |
|---|
| 0 | 1 | 0.613391 |
| 1 | 2 | 0.133333 |
| 2 | 3 | 0.732852 |
| 3 | 4 | 0.446809 |
| 4 | 5 | 0.6 |
1
2add_to_cart = prior_orders.groupby(by='product_id')['add_to_cart_order'].aggregate('mean').to_frame('product_avg_cart_addition').reset_index()
3add_to_cart.head()
| product_id | product_avg_cart_addition |
|---|
| 0 | 1 | 5.80184 |
| 1 | 2 | 9.88889 |
| 2 | 3 | 6.41516 |
| 3 | 4 | 9.5076 |
| 4 | 5 | 6.46667 |
1
2purchased_num_of_times = purchased_num_of_times.merge(product_reorder_ratio, on='product_id', how='left')
3purchased_num_of_times = purchased_num_of_times.merge(add_to_cart, on='product_id', how='left')
4
5
6del [product_reorder_ratio, add_to_cart]
7gc.collect()
1
2purchased_num_of_times.head()
| product_id | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition |
|---|
| 0 | 1 | 1852 | 0.613391 | 5.80184 |
| 1 | 2 | 90 | 0.133333 | 9.88889 |
| 2 | 3 | 277 | 0.732852 | 6.41516 |
| 3 | 4 | 329 | 0.446809 | 9.5076 |
| 4 | 5 | 15 | 0.6 | 6.46667 |
Creating features using user_id and product_id
1
2user_product_data = prior_orders.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('uxp_times_bought').reset_index()
3user_product_data.head()
| user_id | product_id | uxp_times_bought |
|---|
| 0 | 1 | 196 | 10 |
| 1 | 1 | 10258 | 9 |
| 2 | 1 | 10326 | 1 |
| 3 | 1 | 12427 | 10 |
| 4 | 1 | 13032 | 3 |
1
2product_first_order_num = prior_orders.groupby(by=['user_id', 'product_id'])['order_number'].aggregate('min').to_frame('first_order_number').reset_index()
3product_first_order_num.head()
| user_id | product_id | first_order_number |
|---|
| 0 | 1 | 196 | 1 |
| 1 | 1 | 10258 | 2 |
| 2 | 1 | 10326 | 5 |
| 3 | 1 | 12427 | 1 |
| 4 | 1 | 13032 | 2 |
1
2total_orders = prior_orders.groupby('user_id')['order_number'].max().to_frame('total_orders').reset_index()
3total_orders.head()
| user_id | total_orders |
|---|
| 0 | 1 | 10 |
| 1 | 2 | 14 |
| 2 | 3 | 12 |
| 3 | 4 | 5 |
| 4 | 5 | 4 |
1
2user_product_df = pd.merge(total_orders, product_first_order_num, on='user_id', how='right')
3user_product_df.head()
| user_id | total_orders | product_id | first_order_number |
|---|
| 0 | 1 | 10 | 196 | 1 |
| 1 | 1 | 10 | 10258 | 2 |
| 2 | 1 | 10 | 10326 | 5 |
| 3 | 1 | 10 | 12427 | 1 |
| 4 | 1 | 10 | 13032 | 2 |
1
2
3user_product_df['order_range'] = user_product_df['total_orders'] - user_product_df['first_order_number'] + 1
4user_product_df.head()
| user_id | total_orders | product_id | first_order_number | order_range |
|---|
| 0 | 1 | 10 | 196 | 1 | 10 |
| 1 | 1 | 10 | 10258 | 2 | 9 |
| 2 | 1 | 10 | 10326 | 5 | 6 |
| 3 | 1 | 10 | 12427 | 1 | 10 |
| 4 | 1 | 10 | 13032 | 2 | 9 |
1
2number_of_times = prior_orders.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('times_bought').reset_index()
3number_of_times.head()
| user_id | product_id | times_bought |
|---|
| 0 | 1 | 196 | 10 |
| 1 | 1 | 10258 | 9 |
| 2 | 1 | 10326 | 1 |
| 3 | 1 | 12427 | 10 |
| 4 | 1 | 13032 | 3 |
1
2uxp_ratio = pd.merge(number_of_times, user_product_df, on=['user_id', 'product_id'], how='left')
3uxp_ratio.head()
| user_id | product_id | times_bought | total_orders | first_order_number | order_range |
|---|
| 0 | 1 | 196 | 10 | 10 | 1 | 10 |
| 1 | 1 | 10258 | 9 | 10 | 2 | 9 |
| 2 | 1 | 10326 | 1 | 10 | 5 | 6 |
| 3 | 1 | 12427 | 10 | 10 | 1 | 10 |
| 4 | 1 | 13032 | 3 | 10 | 2 | 9 |
1
2uxp_ratio['uxp_reorder_ratio'] = uxp_ratio['times_bought'] / uxp_ratio['order_range']
3uxp_ratio.head()
| user_id | product_id | times_bought | total_orders | first_order_number | order_range | uxp_reorder_ratio |
|---|
| 0 | 1 | 196 | 10 | 10 | 1 | 10 | 1 |
| 1 | 1 | 10258 | 9 | 10 | 2 | 9 | 1 |
| 2 | 1 | 10326 | 1 | 10 | 5 | 6 | 0.166667 |
| 3 | 1 | 12427 | 10 | 10 | 1 | 10 | 1 |
| 4 | 1 | 13032 | 3 | 10 | 2 | 9 | 0.333333 |
1
2uxp_ratio.drop(['times_bought', 'total_orders', 'first_order_number', 'order_range'], axis=1, inplace=True)
3uxp_ratio.head()
| user_id | product_id | uxp_reorder_ratio |
|---|
| 0 | 1 | 196 | 1 |
| 1 | 1 | 10258 | 1 |
| 2 | 1 | 10326 | 0.166667 |
| 3 | 1 | 12427 | 1 |
| 4 | 1 | 13032 | 0.333333 |
1
2user_product_data = user_product_data.merge(uxp_ratio, on=['user_id', 'product_id'], how='left')
3
4
5del [product_first_order_num, number_of_times,user_product_df,total_orders, uxp_ratio]
6gc.collect()
1
2user_product_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio |
|---|
| 0 | 1 | 196 | 10 | 1 |
| 1 | 1 | 10258 | 9 | 1 |
| 2 | 1 | 10326 | 1 | 0.166667 |
| 3 | 1 | 12427 | 10 | 1 |
| 4 | 1 | 13032 | 3 | 0.333333 |
1
2prior_orders['order_number_back'] = prior_orders.groupby(by=['user_id'])['order_number'].transform(max) - prior_orders.order_number + 1
3prior_orders.head()
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | product_id | add_to_cart_order | reordered | order_number_back |
|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 196 | 1 | 0 | 10 |
| 1 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 14084 | 2 | 0 | 10 |
| 2 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 12427 | 3 | 0 | 10 |
| 3 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 26088 | 4 | 0 | 10 |
| 4 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 26405 | 5 | 0 | 10 |
1
2temp_df = prior_orders.loc[prior_orders.order_number_back <= 3]
3temp_df.head()
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | product_id | add_to_cart_order | reordered | order_number_back |
|---|
| 38 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 12427 | 1 | 1 | 3 |
| 39 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 196 | 2 | 1 | 3 |
| 40 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 10258 | 3 | 1 | 3 |
| 41 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 25133 | 4 | 1 | 3 |
| 42 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 46149 | 5 | 0 | 3 |
1
2last_three_order = temp_df.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('uxp_last_three').reset_index()
3last_three_order.head()
| user_id | product_id | uxp_last_three |
|---|
| 0 | 1 | 196 | 3 |
| 1 | 1 | 10258 | 3 |
| 2 | 1 | 12427 | 3 |
| 3 | 1 | 13032 | 1 |
| 4 | 1 | 25133 | 3 |
1
2last_three_order['uxp_ratio_last_three'] = last_three_order['uxp_last_three'] / 3
3last_three_order.head()
| user_id | product_id | uxp_last_three | uxp_ratio_last_three |
|---|
| 0 | 1 | 196 | 3 | 1 |
| 1 | 1 | 10258 | 3 | 1 |
| 2 | 1 | 12427 | 3 | 1 |
| 3 | 1 | 13032 | 1 | 0.333333 |
| 4 | 1 | 25133 | 3 | 1 |
1
2user_product_data = user_product_data.merge(last_three_order, on=['user_id', 'product_id'], how='left')
3
4
5del [last_three_order, temp_df]
6gc.collect()
7
1
2user_product_data.head().head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three |
|---|
| 0 | 1 | 196 | 10 | 1 | 3 | 1 |
| 1 | 1 | 10258 | 9 | 1 | 3 | 1 |
| 2 | 1 | 10326 | 1 | 0.166667 | nan | nan |
| 3 | 1 | 12427 | 10 | 1 | 3 | 1 |
| 4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 |
1
2user_product_data.isnull().sum()
Out: user_id 0
product_id 0
uxp_times_bought 0
uxp_reorder_ratio 0
uxp_last_three 8382738
uxp_ratio_last_three 8382738
dtype: int64
1
2user_product_data.fillna(0, inplace=True)
1
2user_product_data.isnull().sum()
Out: user_id 0
product_id 0
uxp_times_bought 0
uxp_reorder_ratio 0
uxp_last_three 0
uxp_ratio_last_three 0
dtype: int64
Create final dataframe for engineered features
1
2featured_engineered_data = user_product_data.merge(users, on='user_id', how='left')
3featured_engineered_data = featured_engineered_data.merge(purchased_num_of_times, on='product_id', how='left')
4
5
6del [users, user_product_data, purchased_num_of_times]
7gc.collect()
8
9
10featured_engineered_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition |
|---|
| 0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 |
| 1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 |
| 2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 |
| 3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 |
| 4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 |
1
2featured_engineered_data.isnull().sum()
Out: user_id 0
product_id 0
uxp_times_bought 0
uxp_reorder_ratio 0
uxp_last_three 0
uxp_ratio_last_three 0
num_of_orders_for_each_user 0
avg_no_prd_per_order_x 0
avg_no_prd_per_order_y 0
dow_with_most_orders 0
hod_with_most_orders 0
reorder_ratio 0
purchased_num_of_times 0
product_reorder_ratio 0
product_avg_cart_addition 0
dtype: int64
Creating Train and Test datasets
Create training dataset
1
2orders_future = orders[((orders.eval_set=='train') | (orders.eval_set=='test'))]
3orders_future = orders_future[['user_id', 'eval_set', 'order_id']]
1
2final_data = featured_engineered_data.merge(orders_future, on='user_id', how='left')
3final_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id |
|---|
| 0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | train | 1187899 |
| 1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | train | 1187899 |
| 2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | train | 1187899 |
| 3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | train | 1187899 |
| 4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | train | 1187899 |
1
2train_data = final_data[final_data.eval_set=='train']
3train_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id |
|---|
| 0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | train | 1187899 |
| 1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | train | 1187899 |
| 2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | train | 1187899 |
| 3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | train | 1187899 |
| 4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | train | 1187899 |
1
2train_data = train_data.merge(order_products_train[['product_id', 'order_id', 'reordered']], on=['product_id', 'order_id'], how='left')
3train_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id | reordered |
|---|
| 0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | train | 1187899 | 1 |
| 1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | train | 1187899 | 1 |
| 2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | train | 1187899 | nan |
| 3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | train | 1187899 | nan |
| 4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | train | 1187899 | 1 |
1
2train_data.isnull().sum()
Out: user_id 0
product_id 0
uxp_times_bought 0
uxp_reorder_ratio 0
uxp_last_three 0
uxp_ratio_last_three 0
num_of_orders_for_each_user 0
avg_no_prd_per_order_x 0
avg_no_prd_per_order_y 0
dow_with_most_orders 0
hod_with_most_orders 0
reorder_ratio 0
purchased_num_of_times 0
product_reorder_ratio 0
product_avg_cart_addition 0
eval_set 0
order_id 0
reordered 7645837
dtype: int64
1
2train_data['reordered'] = train_data['reordered'].fillna(0)
3train_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id | reordered |
|---|
| 0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | train | 1187899 | 1 |
| 1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | train | 1187899 | 1 |
| 2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | train | 1187899 | 0 |
| 3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | train | 1187899 | 0 |
| 4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | train | 1187899 | 1 |
1
2train_data = train_data.set_index(['user_id', 'product_id'])
1
2train_data = train_data.drop(['eval_set', 'order_id'], axis=1)
| | uxp_times_bought_Unnamed: 2_level_1 | uxp_reorder_ratio_Unnamed: 3_level_1 | uxp_last_three_Unnamed: 4_level_1 | uxp_ratio_last_three_Unnamed: 5_level_1 | num_of_orders_for_each_user_Unnamed: 6_level_1 | avg_no_prd_per_order_x_Unnamed: 7_level_1 | avg_no_prd_per_order_y_Unnamed: 8_level_1 | dow_with_most_orders_Unnamed: 9_level_1 | hod_with_most_orders_Unnamed: 10_level_1 | reorder_ratio_Unnamed: 11_level_1 | purchased_num_of_times_Unnamed: 12_level_1 | product_reorder_ratio_Unnamed: 13_level_1 | product_avg_cart_addition_Unnamed: 14_level_1 | reordered_Unnamed: 15_level_1 |
|---|
| 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | 1 |
| 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | 1 |
| 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | 0 |
| 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | 0 |
| 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | 1 |
Create testing dataset
1
2test_data = final_data[final_data.eval_set=='test']
3test_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id |
|---|
| 120 | 3 | 248 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 6371 | 0.400251 | 10.6208 | test | 2774568 |
| 121 | 3 | 1005 | 1 | 0.333333 | 1 | 0.333333 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 463 | 0.440605 | 9.49892 | test | 2774568 |
| 122 | 3 | 1819 | 3 | 0.333333 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 2424 | 0.492162 | 9.28754 | test | 2774568 |
| 123 | 3 | 7503 | 1 | 0.1 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 12474 | 0.553551 | 9.54738 | test | 2774568 |
| 124 | 3 | 8021 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 27864 | 0.591157 | 8.82285 | test | 2774568 |
1
2test_data = test_data.set_index(['user_id', 'product_id'])
3
4
5test_data = test_data.drop(['eval_set', 'order_id'], axis=1)
6
7
8test_data.head()
| | uxp_times_bought_Unnamed: 2_level_1 | uxp_reorder_ratio_Unnamed: 3_level_1 | uxp_last_three_Unnamed: 4_level_1 | uxp_ratio_last_three_Unnamed: 5_level_1 | num_of_orders_for_each_user_Unnamed: 6_level_1 | avg_no_prd_per_order_x_Unnamed: 7_level_1 | avg_no_prd_per_order_y_Unnamed: 8_level_1 | dow_with_most_orders_Unnamed: 9_level_1 | hod_with_most_orders_Unnamed: 10_level_1 | reorder_ratio_Unnamed: 11_level_1 | purchased_num_of_times_Unnamed: 12_level_1 | product_reorder_ratio_Unnamed: 13_level_1 | product_avg_cart_addition_Unnamed: 14_level_1 |
|---|
| 3 | 248 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 6371 | 0.400251 | 10.6208 |
| 3 | 1005 | 1 | 0.333333 | 1 | 0.333333 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 463 | 0.440605 | 9.49892 |
| 3 | 1819 | 3 | 0.333333 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 2424 | 0.492162 | 9.28754 |
| 3 | 7503 | 1 | 0.1 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 12474 | 0.553551 | 9.54738 |
| 3 | 8021 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 27864 | 0.591157 | 8.82285 |
1
2del [final_data, orders_future, products, order_products_train]
3gc.collect()
Building model using Xplainable Classifier
Build X_train and y_train dataset
1
2X_train, y_train = train_data.drop('reordered', axis=1), train_data.reordered
2. Model Optimization
Optimize hyperparameters using a subset of the data for computational efficiency.
1from xplainable.core.optimisation.bayesian import XParamOptimiser
1opt = XParamOptimiser()
2params = opt.optimise(X_train[:1000000], y_train[:1000000])
Out: 100%|████████| 30/30 [00:45<00:00, 1.53s/trial, best loss: -0.8107380819617527]
3. Model Training
Train the XClassifier with optimized parameters on the full training dataset.
1from xplainable.core.models import XClassifier
2
3model = XClassifier(**params)
4model.fit(X_train, y_train)
Out: <xplainable.core.ml.classification.XClassifier at 0x2a426f760>
4. Model Interpretability and Explainability
Analyze feature importance and model decision-making for the product reorder
predictions.
In the Feature Importances section, we see a list of features with corresponding
importance values. The feature uxp_reorder_ratio has the highest importance,
indicating that it is the most influential factor in the model's predictions.
On the Contributions side, the uxp_reorder_ratio feature also shows a notable
contribution to the model's output. The green bars represent positive contributions,
while the red bars indicate negative contributions. The specific contribution values are
again not directly visible, but the length and color of the bars suggest that
uxp_reorder_ratio has a strong positive influence on the model's predictions.
5. Model Persisting (Optional)
Save the trained model to the Xplainable platform for future use and deployment.
1
2
3
4
5
6
7
8
9
10
11print("Model persistence step skipped - uncomment above code to save model")
6. Model Deployment (Optional)
Deploy the model to the Xplainable platform for real-time predictions.
1
2
3
4
5
6
7print("Model deployment step skipped - uncomment above code to deploy model")
7. Model Testing
Create model predictions using threshold cutoff.
NOTE: Adjust the threshold cutoff to see the impact on the result
1
2test_prediction = (model.predict_proba(test_data) >= 0.21).astype(int)
3test_prediction[:5]
Out: array([0, 0, 0, 0, 0])
1train_prediction = (model.predict_proba(X_train) >= 0.21).astype(int)
2train_prediction[:5]
Out: array([1, 1, 0, 1, 0])
1
2from sklearn.metrics import f1_score, classification_report
1
2print(f'f1 Score: {f1_score(train_prediction, y_train)}')
3print(classification_report(train_prediction, y_train))
Out: f1 Score: 0.41808008442097677
precision recall f1-score support
0 0.92 0.94 0.93 7520042
1 0.45 0.39 0.42 954619
accuracy 0.88 8474661
macro avg 0.69 0.66 0.67 8474661
weighted avg 0.87 0.88 0.87 8474661
1
2test_data['prediction'] = test_prediction
3test_data.head()
| | uxp_times_bought_Unnamed: 2_level_1 | uxp_reorder_ratio_Unnamed: 3_level_1 | uxp_last_three_Unnamed: 4_level_1 | uxp_ratio_last_three_Unnamed: 5_level_1 | num_of_orders_for_each_user_Unnamed: 6_level_1 | avg_no_prd_per_order_x_Unnamed: 7_level_1 | avg_no_prd_per_order_y_Unnamed: 8_level_1 | dow_with_most_orders_Unnamed: 9_level_1 | hod_with_most_orders_Unnamed: 10_level_1 | reorder_ratio_Unnamed: 11_level_1 | purchased_num_of_times_Unnamed: 12_level_1 | product_reorder_ratio_Unnamed: 13_level_1 | product_avg_cart_addition_Unnamed: 14_level_1 | prediction_Unnamed: 15_level_1 |
|---|
| 3 | 248 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 6371 | 0.400251 | 10.6208 | 0 |
| 3 | 1005 | 1 | 0.333333 | 1 | 0.333333 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 463 | 0.440605 | 9.49892 | 0 |
| 3 | 1819 | 3 | 0.333333 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 2424 | 0.492162 | 9.28754 | 0 |
| 3 | 7503 | 1 | 0.1 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 12474 | 0.553551 | 9.54738 | 0 |
| 3 | 8021 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 27864 | 0.591157 | 8.82285 | 0 |
1
2final_df = test_data.reset_index()
3
4
5final_df = final_df[['product_id', 'user_id', 'prediction']]
6
7
8gc.collect()
9final_df.head()
| product_id | user_id | prediction |
|---|
| 0 | 248 | 3 | 0 |
| 1 | 1005 | 3 | 0 |
| 2 | 1819 | 3 | 0 |
| 3 | 7503 | 3 | 0 |
| 4 | 8021 | 3 | 0 |
Creating the Kaggle submission file (optional)
After developing a robust model and ensuring its performance on our validation set, the
next step is to prepare our submission for Kaggle. Although this step is optional, it is
a good practice to understand how to create a submission file that adheres to the
competition's requirements.
To create a submission file, you typically need to:
- Ensure that your model has been trained with the full training set or with an
appropriate cross-validation strategy.
- Generate predictions for the test set provided by Kaggle.
- Format these predictions into a CSV file that matches the submission format of the
competition, which usually involves setting the index to an
id column and including
a column with your predictions.
- Use the
to_csv() function from pandas with the appropriate parameters, such as
index=False if the index should not be included in the submission file, to save
your dataframe to a CSV file.
- Upload this CSV file to the Kaggle competition's submission page to see how your
model performs on the unseen test set.
See specific steps for the kaggle upload below
1
2orders_test = orders.loc[orders.eval_set == 'test', ['user_id', 'order_id']]
3orders_test.head()
| user_id | order_id |
|---|
| 38 | 3 | 2.77457e+06 |
| 44 | 4 | 329954 |
| 53 | 6 | 1.52801e+06 |
| 96 | 11 | 1.37694e+06 |
| 102 | 12 | 1.35684e+06 |
1
2final_df = final_df.merge(orders_test, on='user_id', how='left')
3final_df.head()
| product_id | user_id | prediction | order_id |
|---|
| 0 | 248 | 3 | 0 | 2.77457e+06 |
| 1 | 1005 | 3 | 0 | 2.77457e+06 |
| 2 | 1819 | 3 | 0 | 2.77457e+06 |
| 3 | 7503 | 3 | 0 | 2.77457e+06 |
| 4 | 8021 | 3 | 0 | 2.77457e+06 |
1
2final_df = final_df.drop('user_id', axis=1)
3final_df['product_id'] = final_df.product_id.astype(int)
4final_df.head()
| product_id | prediction | order_id |
|---|
| 0 | 248 | 0 | 2.77457e+06 |
| 1 | 1005 | 0 | 2.77457e+06 |
| 2 | 1819 | 0 | 2.77457e+06 |
| 3 | 7503 | 0 | 2.77457e+06 |
| 4 | 8021 | 0 | 2.77457e+06 |
1
2final_dict = dict()
3for row in final_df.itertuples():
4 if row.prediction== 1:
5 try:
6 final_dict[row.order_id] += ' ' + str(row.product_id)
7 except:
8 final_dict[row.order_id] = str(row.product_id)
9
10
11for order in final_df.order_id:
12 if order not in final_dict:
13 final_dict[order] = 'None'
14
15
16gc.collect()
1
2submission_df = pd.DataFrame.from_dict(final_dict, orient='index')
3
4
5submission_df.reset_index(inplace=True)
6
7
8submission_df.columns = ['order_id', 'products']
9
10
11submission_df.head()
| order_id | products |
|---|
| 0 | 2774568 | 17668 18599 21903 39190 43961 47766 |
| 1 | 1528013 | 21903 38293 |
| 2 | 1376945 | 8309 13176 14947 20383 27959 33572 35948 44632 |
| 3 | 1356845 | 5746 7076 8239 10863 11520 13176 14992 |
| 4 | 2161313 | 196 10441 11266 12427 14715 27839 37710 |
1
2submission_df.to_csv('sub.csv', index=False, header=True)